Prerequisites

What will be done

Data analysis and preparation

We analyze 39 autoimmune disease- and trait-associated SNP sets, obtained from the Supplemental table 1 of the Farh, K. K.-H., Marson, A., Zhu, J., Kleinewietfeld, M., Housley, W. J., Beik, S., … Bernstein, B. E. (2014). “Genetic and epigenetic fine mapping of causal autoimmune disease variants”“ Nature. doi:10.1038/nature13835.

We test each disease- and trait-associated SNP set for enrichment in cell/tissue-specific histone modification marks (broadPeak) obtained from the Roadmap Epigenomics project. The results of this analysis are stored in the data/Roadmap_broadPeak2/ folder.

Set up the environment

suppressMessages(source("utils2.R"))  # See the required packages there
suppressMessages(source("episimilarity.R"))

We first load annotation data containing disease name-category mappings. We have four categories of the diseases/traits - “immunological”, “neurological”, “hematological” and “metabolic”.

We load the matrix of shrunken odds ratios using the ‘load_gr_data’ function. The ‘shrunken’ part refers to the fact that each odds ratio is conservatively shrunken to the lowes boundary of its confidence interval. If a confidence interval overlaps 1, the odds ratio is set to 1. The shrunken odds ratios are log2-transformed.These transformation steps will make the distribution of shrunken odds ratios centered around 0, and safeguard against +/-Inf values. The smaller/larger numbers will correspond to more signficant depletions/enrichments, respectively.

library(xlsx)
# Load term mapping
term.mapping <- read.xlsx2("data/icd9_mapping.xlsx", sheetName = "manual")
# Load the data
mtx <- load_gr_data("data/Roadmap_broadPeak2/matrix_OR.txt")
mtx <- mtx[, match(term.mapping$BED, colnames(mtx))]  # Match and subset the data
# A vector of category names, same order as the correlation matrix
categoryNames <- term.mapping$Category[match(colnames(mtx), term.mapping$Name)]
names(categoryNames) <- term.mapping$Name[match(colnames(mtx), term.mapping$Name)]
# Set side colors
ColSideColors <- as.numeric(factor(categoryNames[colnames(mtx)]))
cbind((categoryNames), ColSideColors)  # Sanity check
ColSideColors[ColSideColors == 1] <- "red"  # Immunological
ColSideColors[ColSideColors == 2] <- "blue"  # Metabolic
ColSideColors[ColSideColors == 3] <- "green"  # Neurological
ColSideColors[ColSideColors == 4] <- "yellow"  # Trait

We first check how the distributions of the log2-transformed odds ratios for each sample look like side-by-side.

ggplot(melt(mtx), aes(x = Var2, y = value, fill = Var2)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + theme(legend.position = "none")

# coord_cartesian(ylim = c(-5, 5)) +

It’s a good idea to scale and center the data. Let’s check how the scaled data looks like.

mtx <- scale(mtx)
ggplot(melt(mtx), aes(x = Var2, y = value, fill = Var2)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + theme(legend.position = "none")

# coord_cartesian(ylim = c(-5, 5)) +

Visualizing epigenomic similarity results

Epigenomic similarity analysis groups SNP sets by comparing their epigenomic enrichemt profiles (sets of log2-transformed shrunken odds-ratios). We scale the matrix, and compare the columns (SNP set-specific profiles) using “euclidean” distance.

# 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary' or 'minkowski'
dist.method <- "euclidean"
d <- dist(t(mtx), method = dist.method)  # Get distance matrix
# 'ward.D', 'ward.D2', 'single', 'complete', 'average' (= UPGMA), 'mcquitty' (= WPGMA), 'median' (= WPGMC) or 'centroid' (= UPGMC)
hclust.method <- "ward.D2"
hclustergram <- hclust(d, method = hclust.method)  # Cluster it

We can also evaluate clustering stability using the ‘pvclust’ package.

library(pvclust)
library(parallel)
cl <- makeCluster(4, type = "PSOCK")
result <- parPvclust(cl = cl, mtx, method.dist = dist.method, method.hclust = hclust.method, nboot = 1000, iseed = 1)
stopCluster(cl)
# saveRDS(result, 'data/pvClust_10000.Rds') result <- readRDS('data/pvClust_10000.Rds')
dev.off()
plot(result)
pvrect(result, alpha = 0.9, type = "leq", max.only = TRUE, lwd = 3)
hclustergram <- result$hclust  # Make the heatmap plot whatever hclust object
Multiscale bootstrap... Done.
null device 
          1 

And, we can visualize this clustering using the ‘aheatmap’ function from the ‘NMF’ package.

suppressMessages(library(NMF))
annot <- data.frame(Category = term.mapping$Category)  #, Size=as.numeric(term.mapping$BEDcount))
annotColor <- list(Category = c("red", "blue", "green", "yellow"))  #, Size=c('red', 'white', 'blue'))
h <- aheatmap(as.matrix(d), Rowv = as.dendrogram(hclustergram), Colv = as.dendrogram(hclustergram), color = (c("red", "yellow", "blue")), distfun = dist.method, hclustfun = hclust.method, annCol = annot, annColors = annotColor, cexCol = 1, cexRow = 1)

We can also estimate similarity among the epigenomic enrichment profiles using Spearman correlation coefficient

# rcorr returns a list, [[1]] - correl coeffs, [[3]] - p-values. Type - pearson/spearman
mtx.cor <- rcorr(as.matrix(mtx), type = "spearman")[[1]]
# Optionally, try Spearman or Kendall correlation mtx.cor[[1]] <- cor(as.matrix(mtx), method='kendall')

The NxN square matrix of correlation coefficients, where N is the count of SNP sets, is visualized as a clustered heatmap. The distance and clustering parameters can be tweaked. By default, “euclidean” distance and “ward.D2” similarity metrics are used. The ‘mtx.plot’ function is a wrapper for plotting the clustered heatmap of correlation coefficients.

mtx.plot(mtx.cor, SideColors = ColSideColors)

Epigenomic differences among the clusters (Differential enrichment analysis)

The heatmap object contains information about the clustering. This information can be visualized as a dendrogram, cut into separate clusters defined by the cut height (user defined). The cluster ordering can be saved.

# Empirically define clusters
par(mfrow = c(1, 2))
plot(as.hclust(h$Colv), cex = 0.3)  # Plot dendrogram
cl_num <- 5  # Empirically set desired numter of clusters
cols <- rainbow(cl_num)  # Make colors for clusters
rect.hclust(as.hclust(h$Colv), k = cl_num, border = cols)  # Define the clusters by rectangles
hcut <- 25  # Empirically set height to cut the tree
abline(h = hcut)
as.hclust(h$Colv)$height %>% density %>% plot
abline(v = hcut)

mtx.clust <- h$Colv %>% mtx.clusters(height = hcut, minmembers = 3)
Cluster01 has  12 members 
Cluster02 has   8 members 
Cluster03 has   3 members 
Cluster04 has   7 members 
Cluster05 has   9 members 
par(mfrow = c(1, 1))
# Save the results of clustering
write.table(as.data.frame(mtx.clust), "results/clustering_all.txt", sep = "\t", row.names = FALSE, quote = FALSE)

‘mtx.clust’ object now contains the cluster groups and labels. By default, if the number of SNP sets in each cluster is less than 3, such cluster is not considered for the differential enrichment analysis.

‘mtx.degfs’ function takes a matrix of odds ratios and compares their distributions between the clusters. Thus, for each GF, odds ratios of FOIs in one cluster are compared with odds ratios of FOIs in another cluster using Welch t-test. The results are outputted into ‘results/degfs_LABEL.xlsx’.

# A matrix to use for differential epigenomic analysis
mtx <- load_gr_data("data/Roadmap_broadPeak2/matrix_OR.txt")
mtx.degfs(mtx[, mtx.clust$eset.labels], mtx.clust, label = "broadPeak2")

Cell type-specific analysis

The cell type specific analysis evaluates cell/tissue types with epigenomic marks most frequently and most significantly enriched with a given SNP set. The results are outputted into an Excel file, with each worksheet containing results for a specific SNP set.

mtx.cellspecific(mtx, "results/cellspecific_broadPeak2.xlsx")